Football is a very exciting sport. Until now, this is the most popular game in the entire Earth planet. Sorry not sorry USA games.

I want to review data collected since 1872 trying to understand how matches between countries have evolved up to this moment. So, we are calling to R and a few libraries to help us visualizing data:

library(tidyverse)
library(plotly)

Dataset

The first thing is to read files. I downloaded this project at 2021-07-22 from Kaggle.

results<-read.csv("results.csv", encoding = "UTF-8")
shootouts<-read.csv("shootouts.csv", encoding = "UTF-8")

This dataset contains data about \(42k+\) football matches in the history of international encounters between national teams. So, let’s take a little taste of the data:

head(results)
head(shootouts)

Analysis

One interesting thing is to take a look of the context of the matches, some of them could be not relevant at all, however there is also Worl cup matches, continental tournaments, and so on:

levels(as.factor(results$tournament)) -> tournaments
sample(tournaments,20)
##  [1] "Copa Roca"                           "Atlantic Heritage Cup"              
##  [3] "Copa Lipton"                         "AFC Asian Cup qualification"        
##  [5] "Dynasty Cup"                         "VFF Cup"                            
##  [7] "Copa Premio Honor Uruguayo"          "Copa Juan Pinto Durán"              
##  [9] "CONCACAF Championship qualification" "ELF Cup"                            
## [11] "African Cup of Nations"              "Lunar New Year Cup"                 
## [13] "International Cup"                   "Confederations Cup"                 
## [15] "AFF Championship qualification"      "Copa Chevallier Boutell"            
## [17] "Brazil Independence Cup"             "AFC Challenge Cup"                  
## [19] "Copa Félix Bogado"                   "Dragon Cup"

Filtering by tournaments with at least 100 matches played in the history:

results %>% group_by(tournament) %>% summarise(count=n()) %>% filter(count > 100) %>% select(tournament) -> popularCups
results %>% filter(tournament %in% popularCups$tournament) %>% ggplot(aes(x=tournament, fill=tournament)) + geom_bar() + coord_flip() -> p 
ggplotly(p)

Now we need to process a little bit the data to assign an stadard way to provide points based on the outcome of every match:

Points Outcome
\(3\) Victory
\(1\) Tie
\(0\) Defeat

In FIFA scores, 2 points can be achieved by winning shootout after a tied match, however I ignore that for the following analysis

Let’s take a look on how it looks now:

results %>%
  mutate(tied=ifelse(home_score == away_score,TRUE,FALSE)) %>%
  mutate(home_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,3,0))) %>%
  mutate(away_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,0,3))) -> results

results %>% filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
head(worldCupResults)

After this step we also need to transform a little bit the structure od this dataset in order to measure performance of each National Team in this way:

Then we can take a look on how it seems (for tournaments that contains "FIFA World Cup" in its name)

results %>% pivot_longer(c(home_team,away_team),names_to = "homeaway", values_to = "team") %>% mutate(points=ifelse(grepl("home",homeaway),home_points,away_points), goals=ifelse(grepl("home",homeaway),home_score,away_score),receivedGoals=ifelse(grepl("home",homeaway),away_score,home_score)) %>% select(date,tournament,country,team,points,goals,receivedGoals) -> results

results %>% filter(grepl("FIFA World Cup",tournament)) -> worldCupResults

worldCupResults %>% group_by(team) %>% summarise(p=sum(points),goals=sum(goals),against=sum(receivedGoals),matches=n()) %>% mutate(performance=p/matches,ofensive=goals/matches,defense=against/matches) %>% arrange(desc(performance)) %>% head()

FIFA World Cup (and qualifiers)

The most interesting matches occurs at FIFA World Cup. So the we can focus on what happens in this tournament:

worldCupResults %>% group_by(team) %>% summarise(p=sum(points),goals=sum(goals),against=sum(receivedGoals),matches=n()) %>% mutate(performance=p/matches,ofensive=goals/matches,defense=against/matches) %>% ggplot(aes(x=performance,y=ofensive,size=defense,color=matches,text=team)) + geom_point() + labs(title="Performance vs offensiveness of National Teams in matches") -> p
ggplotly(p) 

Germany emerges as the best in performance over all the matches related with the World Cup. Is not a surprise at all, remember all of the “goleadas” that has produced, in the qualifiers as well as in the knock-out matches in the final stages of the tournament.

Now we can take a look on what happens if we focus only on the final stage, I mean filering out the qualifiers:

worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% group_by(team) %>% summarise(p=sum(points),goals=sum(goals),against=sum(receivedGoals),matches=n()) %>% mutate(performance=p/matches,ofensive=goals/matches,defense=against/matches) %>% ggplot(aes(x=performance,y=ofensive,size=defense,color=matches,text=team)) + geom_point() + labs(title="Performance vs offensiveness of National Teams in matches of FIFA World Cup") -> p
ggplotly(p) 

Now Brazil is the highlight. And it is not a surprise, they are also a winning machine in football. Germany got the second place (this also is not a surprise).

Another interesting thing to appreciate is the offensive of Hungary! For all of the young people, you must know that in other times they were a true fear for every other nation in football.

And yes, Italy is another highlight, they invented the legendary Catenaccio. This is a very defensive way to play with a great performance as you can see!

You can also look to the Netherlands (the eternal runner-up), and my Mexico that is the “Otra vez nos quedamos sin el quinto partido”.

Overall history of the football

It is also interesting to take a more general view of history of international football. (To forget about the unsurprising performance of my team).

results %>% group_by(team) %>% summarise(p=sum(points),goals=sum(goals),against=sum(receivedGoals),matches=n()) %>% mutate(performance=p/matches,ofensive=goals/matches,defense=against/matches) %>% ggplot(aes(x=performance,y=ofensive,size=defense,color=matches,text=team)) + geom_point() + labs(title="Performance vs offensive in all matches") -> p
ggplotly(p)

And there are a lot of surprises! Maybe they are due to the lack of competitiveness, however it is noticeable that in the center are the most popular teams, also with more played matches, and it seems that competency tends to center the behavior of the parameters selected.

So, we can take a look on how defense relate to offensive:

results %>% group_by(team) %>% summarise(p=sum(points),goals=sum(goals),against=sum(receivedGoals),matches=n()) %>% mutate(performance=p/matches,ofensive=goals/matches,defense=against/matches) %>% ggplot(aes(color=performance,x=ofensive,y=defense,text=paste(team,matches,sep="\n"))) + geom_point() + labs(title="Defense vs offensive in all matches") -> p
ggplotly(p)

Maybe we can construct a ML tool to classify teams according to the performance. There is a pattern, however there is also a lot of noise due to not so professional matches in this dataset.